Since no access to the webserver can be given in this form of submission, this
README.htmlwill provide the most essential information on the project and hopefully display all the steps done in this project as well as answer all upcoming questions.The script itself is appended, too. But for the sake of transparency and clarity of presentation, it is only annotated on the most important parts. The general structure of the project is presented in here instead of the notebook itself.
The main goal of the presented tool is to create a file that contains all the advertised shared flats that have been published on one of the largest providers on the housing market.
In particular, the following tool automatically saves all offers presented on http://www.wg-gesucht.de for a given city (Constance, in this case). Collected data includes the room’s size, price and price per square meter. Furthermore, the address and the household composition (by number of people and gender) of the shared flat are saved. Metadata for each ad include their unique ID as well as the link to the source and the offer’s title description. Moreover, timestamps for activating as well as deactivating the ad are created, allowing to estimate the duration of each ad being active.
This collection of data is achieved by several processes:
Start of the fully-automated collection was on July 23.
The script is saved and ran on a webserver. This guarantees an always available resource that can run the script automatically and store the emerging results.
Due to the simplicity and cost-free service, bwCloud is used. In this, a server is set up which runs the latest stable Ubuntu.
For the execution of the script, the installation of Python 3 was necessary, as well as the installation of pip3. This is mandatory due to the used packages and extensions within the script.
The script is ran automatically every four hours, starting from 8.50am and ending at 8.50pm. This is set as a crontab, which looks like follows:
50 8,12,16,20 * * * /usr/bin/python3 /home/marius/python_project/project.py
It defines the number of minutes 50 and hours 8, 12, 16, 20 of the runtime, whereas the remaining spaces identify the days, months, and weekdays (* * * implying all). Afterwards the path to the used python is given /usr/bin/python3 which can be found on the webserver using which python3. The last part of the crontab identifies the file that should be run and refers to the script itself.
In order to have a fixed structure, in which the tool would run and could be replicated in a better way, the script creates its own environment of directories.
As a consequence, nothing but the script itself is needed in order to start the collection and cleaning of data. The path to where the folders should be created is saved in the script as a separate string-variable. This ensures, that the script could be exported to others systems with only minor change necessary.
The directory ~/data/ is created.
It will contain three subdirectories and the most recent total dataframe.
The subdirectory ~/data/html/ is created.
This will contain subdirectories for each process of scraping.
The subdirectory ~/data/raw/ is created.
In this, the from html-format parsed data will be stored.
The subdirectory ~/data/clean/ is created.
Here will the cleaned data from each parsing be saved.
The subdirectory ~/data/html/today is created.
In this, the part of the path today is not the actual name of the directory but is replaced with the current timestamp of each running of the script in the format of YYYYMMDD-HHMM. In each of these sub-subdirectories, the scraped html- files will be saved. These can serve as a backup since they are the only way to recreate data at a given point of time if something in the process of parsing fails.
The file 0_total.csv is saved at ~/data/.
This file at each given time represents the dataset that contains the most recent information about all ever offered shared flats on the platform.
This is handled in greater detail in section 2.5 Appending.
The file dataframes.txt is created at ~/.
This document is used as a log of all cleaned datasets created in the process of the script’s running. Its usage and setup is explained in greater detail in section 2.4 Parsing and Cleaning.
The so created structure results in an environment of order and guarantees a transparent and functional usage of the tool. For a better understanding, the structure is exemplified by the following snapshot:
tree -L 5
Every time the script runs, the websites most recent 500 ads for shared flats in Constance are saved locally. These are stored as *.html-files. The number of saved results is defined in a separate variable for easier changing (if wanted). It represents the number of pages one wants to store, whereas each page consists of 20 results, all being sorted by recency.
In order to get the scraped data into a human-readable and easier to analyze format, parsing is necessary. In this, the most relevant information is extracted from the *.html-files and transformed into a *.csv file.
In order to do so, an empty dataframe with the relevant columns is set up. The necessary columns and their content will be explained below. Using the beautiful soup-package, the html-container that contains all actual offers of shared flats (and no promotions) is defined. Looping over this allows to extract the following features for each ad:
Geolocator geocode was used to on the one hand correct spelling and the self-reported district of the flat and on the other hand to backward geocode the location, resulting in geocoordinates instead of the actual address. This is necessary for plotting the results in later graphical analysis.
Since active and inactive flats do not share the same structure, these extractions are nested within an if-else construction.
Each turn of the loop over containers results in the creation of a valuelist that can be used to write a row to the dataframe for each ad.
Since some of the extracted information may not be correct (due to misspelled names or an incorrect declaration of the district), further cleaning is necessary.
For this, a hard-coded list of 18 legit districts is defined and used to filter and alter the given information. Then all rows that contain no proper address are deleted.
This seems like a harsh decision but is made with regards to the goal of the analysis; shared flats without any proper address cannot be used in this in any way that was firstly thought of. Since this only affects very few observations (until now less than 5%), this is acceptable.
Afterwards the datatype of the columns are set according to the columns content.
To finalize the process of data-creation and -manipulation, the columns are reordered and the dataframe is returned and saved as an *.csv-file.
After exporting the data, the filename of each cleaned dataset is linewise written (appended) to dataframes.txt. This textfile therefore serves as a log-file as well as a list of all parsed datasets. This can be used to combine all parsed datasets to a masterfile containing the total amount of data; as presented in the next step.
In every execution of the script, the most recent cleaned dataframe is appended to the prior existing ‘total’ dataframe (0_total.csv).
This is guaranteed by using the automatically created log-file of all cleaned datasets (dataframes.txt). The last line of the file is set to be the name of the most recent cleaned dataset.
While appending, previously observed ads are checked for change in their activity status. If there are no changes, the duplicates are dropped. If a formerly active ad has been turned off in the time between running the script, a new timestamp for inactivity is given to that ad and the boolean activity status is set to false. Since some people seem to turn their offer off and on again in order to appear on top of the list again (due to the website showing all offers ordered by recency of being online), the mechanism is implemented vice versa, too. This means that an inactive offer, that has been added with new data as active, is turned to active again and the timestamp of inactive is reset to None.
Afterwards, the updated total dataframe overwrites the old one, leading to 0_total.csv at any given time representing the dataset that contains the most recent information about all ever offered shared flats.
Since the 0_total.csv file does not exist by default, it is created as a copy of the first cleaned dataframe when executing the script for the first time. This is done with shutil. Therefore, in every further runtime, the total file exists and satisfies the given requirements as stated above.
Even though not part of the process of data generation and manipulation, analysis is (or should be) the eventual goal of each project in social science. Since this tool focusses more on a proof of concept, multivariate analysis and complex methods are not provided.
Nevertheless, descriptive statistics and tabulations as well as good visualizations allow for a deep dive into the housing market of Constance and provide fist insights into what is possible with computational approaches like the ones presented in here.
Analysis is not part of the initial script but instead performed in a separate notebook in which the most crucial features of the housing market are examined:
The scraped and cleaned data does have a structure like the following:
Analyzing how long each ad is active before being turned off, the following results emerge:
As can be seen from the graph, 50% of the ads are active between 55-145 hours (~ 2-6 days). Nevertheless, some are closed after less than 24 hours while others have been online for a whole week. This shows, how big the range and variation is, leading to further questions which are discussed in greater detail in chapter 4.1 Potential of Data.
Analyzing the price (in Euro) by district, the results can be summarized with the following boxplots, showing the span as well as the median price for each location:
Whereas the y-axis displays the price of each ad, sorted by district on the x-axis.
In order to make the prices more comparable, they are calculated to be the estimation of Euro per square meter in size. Analyzing this by district, the results can again be summarized with the following boxplots:
Whereas the y-axis displays the price per square meter of each ad, sorted by district on the x-axis.
Analyzing the size (in square meter) by district, the results can be summarized with the following boxplots, showing the span as well as the median price for each location:
Whereas the y-axis displays the size of each room, sorted by district on the x-axis.
In order to get a visual overview of the distribution of rooms, maps can be used. This allows for a fast and comprehensive understanding of the geographical distribution and therefore leads to an intuitive cognition of ‘hotspots’ in Constance. As examples, two visualizations are presented, even though various other types would be possible:
Plotting every active ad as a blue circle and mapping it with a popup, containing information such as the ads ID as well as the price and size of the room, can be seen as a first step in visual exploratory data analysis.
A heatmap contains more generalized data and could therefore be very much suitable for publication. Without showing the concrete address of every each ad, it cummulates to the distribution without losing its key feature, that is providing an easy to understand visualization of the flat’s geographical distribution.
In this section a short overview is given about what future work in this tool and analysis could contain. It does not only highlight chances and further the potential of the given approach but also does point out the opportunities of improvement.
Raising the question: “What further could be done with the created data?” leads to identifying the potential use-cases for scientific research:
* Can only be answered with additional data; this needed data is all potentially available by expanding the scraping-process.
Points of emphasis, in which the method should be further improved:
haupstrasse as a filter for swiss addreses, but this is bad practice and should urgently be improved)All files are zipped in 878725_Groen_ICSS.zip:
[1] Notebook: 00_Create_Modify_Data_878725_Groen.ipynb
[2] Notebook: 01_Analysis_Maps_878725_Groen.ipynb
[3] HTML copy: 00_Create_Modify_Data_878725_Groen.html
[4] HTML copy: 01_Analysis_Maps_878725_Groen.html
[5] Data: 0_total_0730_2050.csv
[6] Code: project.py
Create List of Results (instead of new Vars)
https://stackoverflow.com/questions/6181935/how-do-you-create-different-variable-names-while-in-a-loop
Creating a List of Classes
https://stackoverflow.com/questions/18725760/beautifulsoup-findall-given-multiple-classes?fbclid=IwAR0k_1N3W-Ti-gcpoAalNjLwZcbAmSXVcyoe8zfvVny8TaM5TBls2AnKFBM
Extract Numbers from String
https://stackoverflow.com/questions/4289331/how-to-extract-numbers-from-a-string-in-python
Include Letters in Housenumber from Address-String
https://stackoverflow.com/questions/12572362/get-a-string-after-a-specific-substring
Cut Numbers from Street
https://stackoverflow.com/questions/33940917/how-can-i-remove-the-last-character-of-a-string-in-python?lq=1
Appending to DataFrame
https://stackoverflow.com/questions/49916371/how-to-append-new-row-to-dataframe-in-pandas
How to use Geopy Nominatim
https://gis.stackexchange.com/questions/293615/user-agent-argument-in-nominatim-in-geopy
Rotate axis-labels
https://stackoverflow.com/questions/10998621/rotate-axis-text-in-python-matplotlib
Use float for price instead of int due to pandas not handling NaN / None in int64
https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int
Enumerate in loop for adding n
https://stackoverflow.com/questions/2672936/multiple-counters-in-a-single-for-loop-python
Create directory with Python
https://thispointer.com/how-to-create-a-directory-in-python/
Find Duplicates
https://thispointer.com/pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python/
Drop Duplicates
https://stackoverflow.com/questions/37047420/how-to-drop-rows-of-pandas-dataframe-with-same-value-based-on-condition-in-diffe
Reset Index in New Dataframe
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html